The tidyverse package is a very popular and important package within R. Loading it onto your environment grants you the ability to work with “tidy” data and allows you a higher ease of manipulating dataframes that fall under this data format. Tidy data is any data frame or table where each row represents one observation and each column represents a different variable available for each observation (almost every data frame we have created up to this point counts as a tidy data frame). There are many datasets out there that are not in tidy format and it is there when you must reshape it to tidy in order to be able to manipulate it (we will cover how to do that in later lessons).
Some examples of non tidy data are found below
data("co2")
head(co2)
## [1] 315.42 316.31 316.50 317.56 318.13 318.00
data("BOD")
head(BOD)
## Time demand
## 1 1 8.3
## 2 2 10.3
## 3 3 19.0
## 4 4 16.0
## 5 5 15.6
## 6 7 19.8
This section will cover some of the base functions found within the tidyverse package, these being the mutate, filter, and select functions.
The function mutate allows us to add additional columns without having to run much syntax. The way that the command works is that it take the data frame we want as the first argument, and the name and values of the new variable as a second argument using the “name = values” format. We will practice adding a new variable to the data set below.
# install.packages("tidyverse")
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.8 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.0
## ✔ readr 2.1.2 ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(readr)
library(dplyr)
library(purrr)
# import data set from previous lesson
setwd("/Volumes/GoogleDrive-115381348121898517757/My Drive/All School Files/USC PHD/Files/Non-Class Material/UCLA Summer Course - Intro to Data Science/Datasets")
diabetes <- read_csv("diabetes.csv")
## Rows: 768 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (9): Pregnancies, Glucose, BloodPressure, SkinThickness, Insulin, BMI, D...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# creating a new variable called "geriatric" using an ifelse function
diabetes_v2 <- mutate(diabetes, geriatric = ifelse(Age > 35, 1,0))
# Traditional way of adding a variable to the dataset
diabetes$geriatric <- ifelse(diabetes$Age > 35, 1,0)
Suppose that we want to filter the data table to only show the entries for which the BMI is higher than 23. To do this we use the filter function, which takes the data table as the first argument and then the conditional statement as the second. Like mutate, we can use the unquoted variable names from Diabetes inside the function and it will know we mean the columns and not objects in the workspace.
# filtering through BMI
filter(diabetes, BMI >= 23)
## # A tibble: 707 × 10
## Pregnan…¹ Glucose Blood…² SkinT…³ Insulin BMI Diabe…⁴ Age Outcome geria…⁵
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 6 148 72 35 0 33.6 0.627 50 1 1
## 2 1 85 66 29 0 26.6 0.351 31 0 0
## 3 8 183 64 0 0 23.3 0.672 32 1 0
## 4 1 89 66 23 94 28.1 0.167 21 0 0
## 5 0 137 40 35 168 43.1 2.29 33 1 0
## 6 5 116 74 0 0 25.6 0.201 30 0 0
## 7 3 78 50 32 88 31 0.248 26 1 0
## 8 10 115 0 0 0 35.3 0.134 29 0 0
## 9 2 197 70 45 543 30.5 0.158 53 1 1
## 10 4 110 92 0 0 37.6 0.191 30 0 0
## # … with 697 more rows, and abbreviated variable names ¹Pregnancies,
## # ²BloodPressure, ³SkinThickness, ⁴DiabetesPedigreeFunction, ⁵geriatric
## # ℹ Use `print(n = ...)` to see more rows
BMI <- filter(diabetes, BMI >= 23)
Although our data table only has 9 columns, some data tables include hundreds. If we want to view just a few, we can use the dplyr select function. In the code below we select three columns, assign this to a new object and then filter the new object
new_diabetes <- select(diabetes, Age, BMI, Glucose)
filter(new_diabetes, BMI >=23)
## # A tibble: 707 × 3
## Age BMI Glucose
## <dbl> <dbl> <dbl>
## 1 50 33.6 148
## 2 31 26.6 85
## 3 32 23.3 183
## 4 21 28.1 89
## 5 33 43.1 137
## 6 30 25.6 116
## 7 26 31 78
## 8 29 35.3 115
## 9 53 30.5 197
## 10 30 37.6 110
## # … with 697 more rows
## # ℹ Use `print(n = ...)` to see more rows
# if we want to sort through this new dataset by Age where we get the youngest to oldest, this is what we do
new_diabetes |>
arrange(Age) |>
tail()
## # A tibble: 6 × 3
## Age BMI Glucose
## <dbl> <dbl> <dbl>
## 1 68 35.6 91
## 2 69 26.8 132
## 3 69 0 136
## 4 70 32.5 145
## 5 72 19.6 119
## 6 81 25.9 134
# if you want descending order of Age
new_diabetes |>
arrange(desc(Age)) |>
head()
## # A tibble: 6 × 3
## Age BMI Glucose
## <dbl> <dbl> <dbl>
## 1 81 25.9 134
## 2 72 19.6 119
## 3 70 32.5 145
## 4 69 26.8 132
## 5 69 0 136
## 6 68 35.6 91
# if we want to group by a specific variable, in this case, geriatric, we can do the following
diabetes_v2 |> group_by(geriatric)
## # A tibble: 768 × 10
## # Groups: geriatric [2]
## Pregnan…¹ Glucose Blood…² SkinT…³ Insulin BMI Diabe…⁴ Age Outcome geria…⁵
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 6 148 72 35 0 33.6 0.627 50 1 1
## 2 1 85 66 29 0 26.6 0.351 31 0 0
## 3 8 183 64 0 0 23.3 0.672 32 1 0
## 4 1 89 66 23 94 28.1 0.167 21 0 0
## 5 0 137 40 35 168 43.1 2.29 33 1 0
## 6 5 116 74 0 0 25.6 0.201 30 0 0
## 7 3 78 50 32 88 31 0.248 26 1 0
## 8 10 115 0 0 0 35.3 0.134 29 0 0
## 9 2 197 70 45 543 30.5 0.158 53 1 1
## 10 8 125 96 0 0 0 0.232 54 1 1
## # … with 758 more rows, and abbreviated variable names ¹Pregnancies,
## # ²BloodPressure, ³SkinThickness, ⁴DiabetesPedigreeFunction, ⁵geriatric
## # ℹ Use `print(n = ...)` to see more rows
# special kind of ifelse that works with tidyverse. This case allows us to create or define categorical variables that we may have within our dataset
x <- c(-2,-1,0,1,2)
case_when(x < 0 ~ "Negative",
x > 0 ~ "Positive",
TRUE ~ "Zero")
## [1] "Negative" "Negative" "Zero" "Positive" "Positive"
The following section will be taught a little differently. The code chunks will be provided and you will follow along and program with me throughout the activities as I explain what each function does.
library(dplyr)
library(ggplot2) # we will install both packages necessary to begin plotting
ggplot(data = diabetes)
# install.packages("datasets")
library(datasets)
data("mtcars")
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
# one dimensional plot is one where you plot one single variable at a time
boxplot(mtcars$mpg, col= "green")
hist(mtcars$mpg, col = "green", breaks = 25) ## Plot 2
hist(mtcars$mpg, col = "green", breaks = 50) ## Plot 3
barplot(table(mtcars$carb), col="grey")
# Two dimensional plots
boxplot(mpg~wt, data=mtcars, col = "grey")
hist(subset(mtcars, cyl == 4)$mpg, col = "green")
with(mtcars, plot(wt, mpg))
# Using the plot function in r
plot(3, 4)
plot(c(1, 3, 4), c(4, 5 , 8))
plot(1:20)
# Values for x and y axis
x <- 1:5; y = x * x
# Using plot() function
plot(x, y, type = "l") # l stands for line
plot(x, y, type = "h") # h stands for histogram
# R program to plot a graph
# Creating x and y-values
x - 1:5; y = x * x
## [1] 0 0 0 0 0
# Using plot function
plot(x, y, type = "b")
plot(x, y, type = "s")
plot(x, y, type = "p")
The following chunks of code is an example of just how one can use the culmination of conditional statements and lists to create beautiful plots. Credits to https://towardsdatascience.com/christmas-cards-81e7e1cce21c for showcasing this code.
# install.packages("plotly")
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
set.seed(24)
n_tree <- 1000
n_ornaments <- 20
n_lights <- 300
# Generate spiral data points
x <- c()
y <- c()
z <- c()
for (i in 1:n_tree) {
r <- i / 30
x <- c(x, r * cos(i / 30))
y <- c(y, r * sin(i / 30))
z <- c(z, n_tree - i)
}
tree <- data.frame(x, y, z)
# Sample for ornaments:
# - sample n_ornaments points from the tree spiral
# - modify z so that the ornaments are below the line
# - color column: optional, add if you want to add color range to ornaments
ornaments <- tree[sample(nrow(tree), n_ornaments), ]
ornaments$z <- ornaments$z - 50
ornaments$color <- 1:nrow(ornaments)
# Sample for lights:
# - sample n_lights points from the tree spiral
# - Add normal noise to z so the lights spread out
lights <- tree[sample(nrow(tree), n_lights), ]
lights$x <- lights$x + rnorm(n_lights, 0, 20)
lights$y <- lights$y + rnorm(n_lights, 0, 20)
lights$z <- lights$z + rnorm(n_lights, 0, 20)
# hide axes
ax <- list(
title = "",
zeroline = FALSE,
showline = FALSE,
showticklabels = FALSE,
showgrid = FALSE
)
plot_ly() %>%
add_trace(data = tree, x = ~x, y = ~y, z = ~z,
type = "scatter3d", mode = "lines",
line = list(color = "#1A8017", width = 7)) %>%
add_markers(data = ornaments, x = ~x, y = ~y, z = ~z,
type = "scatter3d",
marker = list(color = ~color,
colorscale = list(c(0,'#EA4630'), c(1,'#CF140D')),
size = 15)) %>%
add_markers(data = lights, x = ~x, y = ~y, z = ~z,
type = "scatter3d",
marker = list(color = "#FDBA1C", size = 3, opacity = 0.8)) %>%
layout(scene = list(xaxis=ax, yaxis=ax, zaxis=ax), showlegend = FALSE)